La principal aportación de la regresión logística consiste en extener el modelo teórico de la regresión simple y multivariante, para aplicarla a problemas en los que la variable de salida del modelo es discreta o categórica, en lugar de continua.
Nuestro objetivo de negocio es generar estructural departamentales y para ello necesitamos definir dos grandes grupos principalmente. Por un Lado buscamos los perfiles de casas de Alto Poder adquisitivo versus el Resto. Para ello nos apoyaremos de la Regresión Logística y necesitaremos definir un cluster 1/0 en el que las casas con valor 1 son aquellas de alto poder adquisitivo o Precio alto y 0 el resto de casas. ¿cómo definimos este punto de corte? Vamos a leer nuestra BBDD y estudiar la variable precio para encontrar el punto de corte óptimo para nuestro análisis
#setwd("C:/Users/Pablo/Desktop/Machine_Learning_I/Z_PRACTICA_MACHINE_LEARNING/machineLearning1Process")
df_cluster <- read.csv ("../cluster.csv")
#setwd("C:/Users/Pablo/Desktop/Machine_Learning_I/Z_PRACTICA_MACHINE_LEARNING/machineLearning1Process")
df_root <- read.csv ("../kc_house_data.csv")
df_cluster$hclust=as.numeric(df_cluster$hclust)
df_cluster$cluster_final[df_cluster$hclust==2 | df_cluster$hclust==8] <- "top"
df_cluster$cluster_final[df_cluster$hclust==1 | df_cluster$hclust==5 | df_cluster$hclust==9] <- "low"
df_cluster$cluster_final[df_cluster$hclust==3 | df_cluster$hclust==4 | df_cluster$hclust==6 | df_cluster$hclust==7 ] <- "med"
Antes de plantear un punto de corte para crear una variable dicomtómica y estudiar un GLM plantearemos una Árbol de regression para estudiar el comportamiento de la variable Precio que tiene una distribucón de este estilo, no sigue una distribución normal. Parece una Gamma.
df_cluster$hclust<-as.factor(df_cluster$hclust)
ggplot(data=df_cluster, aes(x=price, group=hclust ,fill=hclust)) +
geom_density(adjust=1.5)
df_cluster$cluster<-as.factor(df_cluster$cluster)
ggplot(data=df_cluster, aes(x=price, group=cluster ,fill=cluster)) +
geom_density(adjust=1.5)
densidad <- density(df_cluster$price)
plot(densidad, main="Gráfica de densidad de la variable precio de la vivienda" , xlim=c(0,4000000))
polygon(densidad, col="red")
describe(df_cluster$price)
## df_cluster$price
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 3622 1 540297 329526 210000 245000
## .25 .50 .75 .90 .95
## 322000 450000 645000 887000 1160000
##
## lowest : 78000 80000 81000 82000 82500
## highest: 5350000 5570000 6890000 7060000 7700000
histograma <- ggplot(df_cluster, aes(x=price)) +
ggtitle("Precio de las viviendas") +
theme_fivethirtyeight() +
geom_histogram(color="#28324a", fill="#3c78d8")
histograma
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Vemos que la distribución de la variable precio es muy asintótica hacia la derecha, es decir hay muchos valores extremos que puede que nos desvirtuen el análisis de la media, El punto de corte que vamos a determinar es el percentil 75 de la variable precio. La variable cluster será 1 cuando los precios sean mayores a percentil 75 de la variable precio y 0 en caso contrario
Vemos en el gráfico Box Plot que la distribución de la población cluster 1, las viviendas caras, tiene muchos valores extremos. Para realziar un buen análisis deberáimos extraerlos pero es importante para nuestro negocio por lo que vamos a mantenerlos a ver si somos capaces de realizar una buena predicción.
target1 <- filter(df_cluster, price > 645000)
summary(target1)
## X.2 X.1 X Y
## Min. : 6 Min. : 6 Min. :-29.306 Min. :-28.849
## 1st Qu.: 5308 1st Qu.: 5308 1st Qu.:-12.556 1st Qu.:-16.927
## Median :11034 Median :11034 Median : -6.444 Median : -5.068
## Mean :10972 Mean :10972 Mean : -6.725 Mean : -3.355
## 3rd Qu.:16512 3rd Qu.:16512 3rd Qu.: -0.709 3rd Qu.: 9.205
## Max. :21591 Max. :21591 Max. : 27.464 Max. : 29.774
##
## id date price bedrooms
## Min. :1.200e+06 6/26/2014: 40 Min. : 645500 Min. : 1.000
## 1st Qu.:1.939e+09 6/20/2014: 39 1st Qu.: 723500 1st Qu.: 3.000
## Median :3.886e+09 3/25/2015: 37 Median : 826000 Median : 4.000
## Mean :4.509e+09 4/23/2015: 37 Mean : 987839 Mean : 3.824
## 3rd Qu.:7.301e+09 7/14/2014: 37 3rd Qu.:1050000 3rd Qu.: 4.000
## Max. :9.839e+09 4/27/2015: 36 Max. :7700000 Max. :10.000
## (Other) :5145
## bathrooms sqft_living sqft_lot floors
## Min. :0.750 Min. : 890 Min. : 520 Min. :1.000
## 1st Qu.:2.250 1st Qu.: 2330 1st Qu.: 5400 1st Qu.:1.000
## Median :2.500 Median : 2880 Median : 8580 Median :2.000
## Mean :2.675 Mean : 2999 Mean : 21363 Mean :1.713
## 3rd Qu.:3.250 3rd Qu.: 3510 3rd Qu.: 14348 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
##
## waterfront view condition grade
## Min. :0.00000 Min. :0.0000 Min. :1.000 Min. : 5.000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 8.000
## Median :0.00000 Median :0.0000 Median :3.000 Median : 9.000
## Mean :0.02514 Mean :0.6258 Mean :3.462 Mean : 8.844
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:10.000
## Max. :1.00000 Max. :4.0000 Max. :5.000 Max. :13.000
##
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 580 Min. : 0.0 Min. :1900 Min. : 0.0
## 1st Qu.:1760 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0
## Median :2430 Median : 0.0 Median :1981 Median : 0.0
## Mean :2534 Mean : 464.9 Mean :1973 Mean : 158.2
## 3rd Qu.:3160 3rd Qu.: 890.0 3rd Qu.:2001 3rd Qu.: 0.0
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.0
##
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 860
## 1st Qu.:98033 1st Qu.:47.57 1st Qu.:-122.3 1st Qu.:2020
## Median :98074 Median :47.63 Median :-122.2 Median :2570
## Mean :98074 Mean :47.62 Mean :-122.2 Mean :2619
## 3rd Qu.:98115 3rd Qu.:47.67 3rd Qu.:-122.1 3rd Qu.:3130
## Max. :98199 Max. :47.78 Max. :-121.7 Max. :6210
##
## sqft_lot15 cluster hclust Cluster_final cluster_final
## Min. : 967 1:1461 8 :1550 low: 977 Length:5371
## 1st Qu.: 5250 2: 687 1 : 849 med:2319 Class :character
## Median : 8416 3:3223 3 : 750 top:2075 Mode :character
## Mean : 16300 4 : 627
## 3rd Qu.: 12912 2 : 525
## Max. :871200 7 : 496
## (Other): 574
target0 <- filter(df_cluster, price <= 645000)
summary(target0)
## X.2 X.1 X Y
## Min. : 1 Min. : 1 Min. :-30.601 Min. :-28.717
## 1st Qu.: 5418 1st Qu.: 5418 1st Qu.: -4.623 1st Qu.:-10.053
## Median :10720 Median :10720 Median : 2.784 Median : 1.745
## Mean :10742 Mean :10742 Mean : 2.226 Mean : 1.110
## 3rd Qu.:16089 3rd Qu.:16089 3rd Qu.: 11.198 3rd Qu.: 11.682
## Max. :21597 Max. :21597 Max. : 27.537 Max. : 30.132
##
## id date price bedrooms
## Min. :1.000e+06 6/23/2014: 111 Min. : 78000 Min. : 1.000
## 1st Qu.:2.155e+09 6/25/2014: 95 1st Qu.:290000 1st Qu.: 3.000
## Median :3.905e+09 8/26/2014: 94 Median :385000 Median : 3.000
## Mean :4.604e+09 7/8/2014 : 93 Mean :392154 Mean : 3.224
## 3rd Qu.:7.312e+09 4/14/2015: 92 3rd Qu.:491838 3rd Qu.: 4.000
## Max. :9.900e+09 4/22/2015: 91 Max. :645000 Max. :33.000
## (Other) :15650
## bathrooms sqft_living sqft_lot floors
## Min. :0.500 Min. : 370 Min. : 572 Min. :1.000
## 1st Qu.:1.500 1st Qu.:1310 1st Qu.: 5000 1st Qu.:1.000
## Median :2.000 Median :1700 Median : 7450 Median :1.000
## Mean :1.931 Mean :1776 Mean : 13026 Mean :1.422
## 3rd Qu.:2.500 3rd Qu.:2160 3rd Qu.: 9950 3rd Qu.:2.000
## Max. :7.500 Max. :5461 Max. :1164794 Max. :3.500
##
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 3.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.000 Median : 7.000
## Mean :0.001726 Mean :0.1047 Mean :3.393 Mean : 7.265
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :11.000
##
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 370 Min. : 0.0 Min. :1900 Min. : 0.00
## 1st Qu.:1120 1st Qu.: 0.0 1st Qu.:1952 1st Qu.: 0.00
## Median :1400 Median : 0.0 Median :1972 Median : 0.00
## Mean :1542 Mean : 234.4 Mean :1970 Mean : 60.06
## 3rd Qu.:1850 3rd Qu.: 450.0 3rd Qu.:1994 3rd Qu.: 0.00
## Max. :5450 Max. :2196.0 Max. :2015 Max. :2015.00
##
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98032 1st Qu.:47.42 1st Qu.:-122.3 1st Qu.:1410
## Median :98065 Median :47.54 Median :-122.3 Median :1690
## Mean :98079 Mean :47.54 Mean :-122.2 Mean :1777
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2080
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :4362
##
## sqft_lot15 cluster hclust Cluster_final cluster_final
## Min. : 651 1:5794 1 :3781 low:8993 Length:16226
## 1st Qu.: 5066 2:5000 9 :3079 med:5906 Class :character
## Median : 7500 3:5432 5 :2133 top:1327 Mode :character
## Mean : 11586 3 :2109
## 3rd Qu.: 9600 7 :1466
## Max. :438213 6 :1344
## (Other):2314
df_cluster$cluster=as.numeric(df_cluster$hclust)
df_cluster$target[df_cluster$price> 645000] <- '1'
df_cluster$target[df_cluster$price<= 645000] <- '0'
table(df_cluster$target)
##
## 0 1
## 16226 5371
df_cluster$cluster<-as.factor(df_cluster$target)
ggplot(data=df_cluster, aes(x=price, group=target ,fill=target)) +
ggtitle("Precio de las viviendas por cluster") +
geom_density(adjust=1.5)
df_cluster %>%
ggplot( aes(x=target, y=price, fill=target)) +
ggtitle("Precio de las viviendas por target") +
geom_violin() +
xlab("class") +
theme(legend.position="none") +
xlab("")
## EDA por cluster Vamos a realizar un análisis descriptivo de las variables en función del cluster
df_target_1 = filter(df_cluster, target == "1")
#Muestra de las primeras 5 filas de la base de datos
kable(head(df_target_1)) %>%
kable_styling() %>%
scroll_box(width = "100%", height = TRUE)
| X.2 | X.1 | X | Y | id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | cluster | hclust | Cluster_final | cluster_final | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 6 | -2.654252 | -26.7202890 | 7237550310 | 5/12/2014 | 1230000 | 4 | 4.50 | 5420 | 101930 | 1.0 | 0 | 0 | 3 | 11 | 3890 | 1530 | 2001 | 0 | 98053 | 47.6561 | -122.005 | 4760 | 101930 | 1 | 4 | med | med | 1 |
| 11 | 11 | 8.887572 | -0.5260397 | 1736800520 | 4/3/2015 | 662500 | 3 | 2.50 | 3560 | 9796 | 1.0 | 0 | 0 | 3 | 8 | 1860 | 1700 | 1965 | 0 | 98007 | 47.6007 | -122.145 | 2210 | 8925 | 1 | 3 | med | med | 1 |
| 16 | 16 | -12.593827 | -1.7041182 | 9297300055 | 1/24/2015 | 650000 | 4 | 3.00 | 2950 | 5000 | 2.0 | 0 | 3 | 3 | 9 | 1980 | 970 | 1979 | 0 | 98126 | 47.5714 | -122.375 | 2140 | 4000 | 1 | 7 | med | med | 1 |
| 22 | 22 | -5.461244 | -0.5643586 | 2524049179 | 8/26/2014 | 2000000 | 3 | 2.75 | 3050 | 44867 | 1.0 | 0 | 4 | 3 | 9 | 2330 | 720 | 1968 | 0 | 98040 | 47.5316 | -122.233 | 4110 | 20336 | 1 | 3 | med | med | 1 |
| 27 | 27 | -11.462589 | 22.0585529 | 1794500383 | 6/26/2014 | 937000 | 3 | 1.75 | 2450 | 2691 | 2.0 | 0 | 0 | 3 | 8 | 1750 | 700 | 1915 | 0 | 98119 | 47.6386 | -122.360 | 1760 | 3573 | 1 | 1 | low | low | 1 |
| 28 | 28 | -7.759261 | 24.3302470 | 3303700376 | 12/1/2014 | 667000 | 3 | 1.00 | 1400 | 1581 | 1.5 | 0 | 0 | 5 | 8 | 1400 | 0 | 1909 | 0 | 98112 | 47.6221 | -122.314 | 1860 | 3861 | 1 | 1 | low | low | 1 |
#Tabla resumen con los principales estadísticos
kable(summary(df_target_1)) %>%
kable_styling() %>%
scroll_box(width = "100%", height = TRUE)
| X.2 | X.1 | X | Y | id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | cluster | hclust | Cluster_final | cluster_final | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 6 | Min. : 6 | Min. :-29.306 | Min. :-28.849 | Min. :1.200e+06 | 6/26/2014: 40 | Min. : 645500 | Min. : 1.000 | Min. :0.750 | Min. : 890 | Min. : 520 | Min. :1.000 | Min. :0.00000 | Min. :0.0000 | Min. :1.000 | Min. : 5.000 | Min. : 580 | Min. : 0.0 | Min. :1900 | Min. : 0.0 | Min. :98001 | Min. :47.16 | Min. :-122.5 | Min. : 860 | Min. : 967 | 0: 0 | 8 :1550 | low: 977 | Length:5371 | Length:5371 | |
| 1st Qu.: 5308 | 1st Qu.: 5308 | 1st Qu.:-12.556 | 1st Qu.:-16.927 | 1st Qu.:1.939e+09 | 6/20/2014: 39 | 1st Qu.: 723500 | 1st Qu.: 3.000 | 1st Qu.:2.250 | 1st Qu.: 2330 | 1st Qu.: 5400 | 1st Qu.:1.000 | 1st Qu.:0.00000 | 1st Qu.:0.0000 | 1st Qu.:3.000 | 1st Qu.: 8.000 | 1st Qu.:1760 | 1st Qu.: 0.0 | 1st Qu.:1951 | 1st Qu.: 0.0 | 1st Qu.:98033 | 1st Qu.:47.57 | 1st Qu.:-122.3 | 1st Qu.:2020 | 1st Qu.: 5250 | 1:5371 | 1 : 849 | med:2319 | Class :character | Class :character | |
| Median :11034 | Median :11034 | Median : -6.444 | Median : -5.068 | Median :3.886e+09 | 3/25/2015: 37 | Median : 826000 | Median : 4.000 | Median :2.500 | Median : 2880 | Median : 8580 | Median :2.000 | Median :0.00000 | Median :0.0000 | Median :3.000 | Median : 9.000 | Median :2430 | Median : 0.0 | Median :1981 | Median : 0.0 | Median :98074 | Median :47.63 | Median :-122.2 | Median :2570 | Median : 8416 | NA | 3 : 750 | top:2075 | Mode :character | Mode :character | |
| Mean :10972 | Mean :10972 | Mean : -6.725 | Mean : -3.355 | Mean :4.509e+09 | 4/23/2015: 37 | Mean : 987839 | Mean : 3.824 | Mean :2.675 | Mean : 2999 | Mean : 21363 | Mean :1.713 | Mean :0.02514 | Mean :0.6258 | Mean :3.462 | Mean : 8.844 | Mean :2534 | Mean : 464.9 | Mean :1973 | Mean : 158.2 | Mean :98074 | Mean :47.62 | Mean :-122.2 | Mean :2619 | Mean : 16300 | NA | 4 : 627 | NA | NA | NA | |
| 3rd Qu.:16512 | 3rd Qu.:16512 | 3rd Qu.: -0.709 | 3rd Qu.: 9.205 | 3rd Qu.:7.301e+09 | 7/14/2014: 37 | 3rd Qu.:1050000 | 3rd Qu.: 4.000 | 3rd Qu.:3.250 | 3rd Qu.: 3510 | 3rd Qu.: 14348 | 3rd Qu.:2.000 | 3rd Qu.:0.00000 | 3rd Qu.:0.0000 | 3rd Qu.:4.000 | 3rd Qu.:10.000 | 3rd Qu.:3160 | 3rd Qu.: 890.0 | 3rd Qu.:2001 | 3rd Qu.: 0.0 | 3rd Qu.:98115 | 3rd Qu.:47.67 | 3rd Qu.:-122.1 | 3rd Qu.:3130 | 3rd Qu.: 12912 | NA | 2 : 525 | NA | NA | NA | |
| Max. :21591 | Max. :21591 | Max. : 27.464 | Max. : 29.774 | Max. :9.839e+09 | 4/27/2015: 36 | Max. :7700000 | Max. :10.000 | Max. :8.000 | Max. :13540 | Max. :1651359 | Max. :3.500 | Max. :1.00000 | Max. :4.0000 | Max. :5.000 | Max. :13.000 | Max. :9410 | Max. :4820.0 | Max. :2015 | Max. :2015.0 | Max. :98199 | Max. :47.78 | Max. :-121.7 | Max. :6210 | Max. :871200 | NA | 7 : 496 | NA | NA | NA | |
| NA | NA | NA | NA | NA | (Other) :5145 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | (Other): 574 | NA | NA | NA |
df_target_0 = filter(df_cluster, target == "0")
#Muestra de las primeras 5 filas de la base de datos
kable(head(df_target_0)) %>%
kable_styling() %>%
scroll_box(width = "100%", height = TRUE)
| X.2 | X.1 | X | Y | id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | cluster | hclust | Cluster_final | cluster_final | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 3.7621642 | 22.445021 | 7129300520 | 10/13/2014 | 221900 | 3 | 1.00 | 1180 | 5650 | 1 | 0 | 0 | 3 | 7 | 1180 | 0 | 1955 | 0 | 98178 | 47.5112 | -122.257 | 1340 | 5650 | 0 | 1 | low | low | 0 |
| 2 | 2 | -24.0359753 | 9.766915 | 6414100192 | 12/9/2014 | 538000 | 3 | 2.25 | 2570 | 7242 | 2 | 0 | 0 | 3 | 7 | 2170 | 400 | 1951 | 1991 | 98125 | 47.7210 | -122.319 | 1690 | 7639 | 0 | 2 | top | top | 0 |
| 3 | 3 | 5.2170153 | 8.102678 | 5631500400 | 2/25/2015 | 180000 | 2 | 1.00 | 770 | 10000 | 1 | 0 | 0 | 3 | 6 | 770 | 0 | 1933 | 0 | 98028 | 47.7379 | -122.233 | 2720 | 8062 | 0 | 3 | med | med | 0 |
| 4 | 4 | 7.5887297 | 18.967730 | 2487200875 | 12/9/2014 | 604000 | 4 | 3.00 | 1960 | 5000 | 1 | 0 | 0 | 5 | 7 | 1050 | 910 | 1965 | 0 | 98136 | 47.5208 | -122.393 | 1360 | 5000 | 0 | 1 | low | low | 0 |
| 5 | 5 | 0.6661815 | 3.939548 | 1954400510 | 2/18/2015 | 510000 | 3 | 2.00 | 1680 | 8080 | 1 | 0 | 0 | 3 | 8 | 1680 | 0 | 1987 | 0 | 98074 | 47.6168 | -122.045 | 1800 | 7503 | 0 | 3 | med | med | 0 |
| 7 | 7 | 15.5619202 | -13.926208 | 1321400060 | 6/27/2014 | 257500 | 3 | 2.25 | 1715 | 6819 | 2 | 0 | 0 | 3 | 7 | 1715 | 0 | 1995 | 0 | 98003 | 47.3097 | -122.327 | 2238 | 6819 | 0 | 5 | low | low | 0 |
#Tabla resumen con los principales estadísticos
kable(summary(df_target_0)) %>%
kable_styling() %>%
scroll_box(width = "100%", height = TRUE)
| X.2 | X.1 | X | Y | id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | cluster | hclust | Cluster_final | cluster_final | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1 | Min. : 1 | Min. :-30.601 | Min. :-28.717 | Min. :1.000e+06 | 6/23/2014: 111 | Min. : 78000 | Min. : 1.000 | Min. :0.500 | Min. : 370 | Min. : 572 | Min. :1.000 | Min. :0.000000 | Min. :0.0000 | Min. :1.000 | Min. : 3.000 | Min. : 370 | Min. : 0.0 | Min. :1900 | Min. : 0.00 | Min. :98001 | Min. :47.16 | Min. :-122.5 | Min. : 399 | Min. : 651 | 0:16226 | 1 :3781 | low:8993 | Length:16226 | Length:16226 | |
| 1st Qu.: 5418 | 1st Qu.: 5418 | 1st Qu.: -4.623 | 1st Qu.:-10.053 | 1st Qu.:2.155e+09 | 6/25/2014: 95 | 1st Qu.:290000 | 1st Qu.: 3.000 | 1st Qu.:1.500 | 1st Qu.:1310 | 1st Qu.: 5000 | 1st Qu.:1.000 | 1st Qu.:0.000000 | 1st Qu.:0.0000 | 1st Qu.:3.000 | 1st Qu.: 7.000 | 1st Qu.:1120 | 1st Qu.: 0.0 | 1st Qu.:1952 | 1st Qu.: 0.00 | 1st Qu.:98032 | 1st Qu.:47.42 | 1st Qu.:-122.3 | 1st Qu.:1410 | 1st Qu.: 5066 | 1: 0 | 9 :3079 | med:5906 | Class :character | Class :character | |
| Median :10720 | Median :10720 | Median : 2.784 | Median : 1.745 | Median :3.905e+09 | 8/26/2014: 94 | Median :385000 | Median : 3.000 | Median :2.000 | Median :1700 | Median : 7450 | Median :1.000 | Median :0.000000 | Median :0.0000 | Median :3.000 | Median : 7.000 | Median :1400 | Median : 0.0 | Median :1972 | Median : 0.00 | Median :98065 | Median :47.54 | Median :-122.3 | Median :1690 | Median : 7500 | NA | 5 :2133 | top:1327 | Mode :character | Mode :character | |
| Mean :10742 | Mean :10742 | Mean : 2.226 | Mean : 1.110 | Mean :4.604e+09 | 7/8/2014 : 93 | Mean :392154 | Mean : 3.224 | Mean :1.931 | Mean :1776 | Mean : 13026 | Mean :1.422 | Mean :0.001726 | Mean :0.1047 | Mean :3.393 | Mean : 7.265 | Mean :1542 | Mean : 234.4 | Mean :1970 | Mean : 60.06 | Mean :98079 | Mean :47.54 | Mean :-122.2 | Mean :1777 | Mean : 11586 | NA | 3 :2109 | NA | NA | NA | |
| 3rd Qu.:16089 | 3rd Qu.:16089 | 3rd Qu.: 11.198 | 3rd Qu.: 11.682 | 3rd Qu.:7.312e+09 | 4/14/2015: 92 | 3rd Qu.:491838 | 3rd Qu.: 4.000 | 3rd Qu.:2.500 | 3rd Qu.:2160 | 3rd Qu.: 9950 | 3rd Qu.:2.000 | 3rd Qu.:0.000000 | 3rd Qu.:0.0000 | 3rd Qu.:4.000 | 3rd Qu.: 8.000 | 3rd Qu.:1850 | 3rd Qu.: 450.0 | 3rd Qu.:1994 | 3rd Qu.: 0.00 | 3rd Qu.:98118 | 3rd Qu.:47.68 | 3rd Qu.:-122.1 | 3rd Qu.:2080 | 3rd Qu.: 9600 | NA | 7 :1466 | NA | NA | NA | |
| Max. :21597 | Max. :21597 | Max. : 27.537 | Max. : 30.132 | Max. :9.900e+09 | 4/22/2015: 91 | Max. :645000 | Max. :33.000 | Max. :7.500 | Max. :5461 | Max. :1164794 | Max. :3.500 | Max. :1.000000 | Max. :4.0000 | Max. :5.000 | Max. :11.000 | Max. :5450 | Max. :2196.0 | Max. :2015 | Max. :2015.00 | Max. :98199 | Max. :47.78 | Max. :-121.3 | Max. :4362 | Max. :438213 | NA | 6 :1344 | NA | NA | NA | |
| NA | NA | NA | NA | NA | (Other) :15650 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | (Other):2314 | NA | NA | NA |
Estudio de la variable “price” (precio de venta).
my3cols <- c("#E7B800", "#2E9FDF", "#FC4E07")
df_cluster$target <- as.factor(df_cluster$target)
pBoxPlot <- ggplot(df_cluster, aes(x = target, y = price))
bxp <- pBoxPlot + geom_boxplot(aes(color = target)) +
scale_color_manual(values = my3cols)
geomHist <- ggplot(data=df_cluster, aes(df_cluster$price))
gh <- geomHist + geom_histogram(aes(color = target)) +
scale_color_manual(values = my3cols)
figure <- ggarrange(bxp, gh,
labels = c("Boxplot", "Histogram"),
ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
figure
Estudio de las variables “sqft_living” y “sqft_living15” (Superficie de la vivienda). variable “sqft_living”: superficie de la vivienda en pies cuadrados (superficie escriturada).
my3cols <- c("#E7B800", "#2E9FDF", "#FC4E07")
df_cluster$target <- as.factor(df_cluster$target)
pBoxPlot <- ggplot(df_cluster, aes(x = target, y = sqft_living))
bxp <- pBoxPlot + geom_boxplot(aes(color = target)) +
scale_color_manual(values = my3cols)
geomHist <- ggplot(data=df_cluster, aes(df_cluster$sqft_living))
gh <- geomHist + geom_histogram(aes(color = target)) +
scale_color_manual(values = my3cols)
figure <- ggarrange(bxp, gh,
labels = c("Boxplot", "Histogram"),
ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
figure
Variable “sqft_lot15”: superficie de la parcela en el año 2015
my3cols <- c("#E7B800", "#2E9FDF", "#FC4E07")
df_cluster$target <- as.factor(df_cluster$target)
pBoxPlot <- ggplot(df_cluster, aes(x = target, y = sqft_lot15))
bxp <- pBoxPlot + geom_boxplot(aes(color = target)) +
scale_color_manual(values = my3cols)
geomHist <- ggplot(data=df_cluster, aes(df_cluster$sqft_lot15))
gh <- geomHist + geom_histogram(aes(color = target)) +
scale_color_manual(values = my3cols)
figure <- ggarrange(bxp, gh,
labels = c("Boxplot", "Histogram"),
ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
figure
Estudio de la variable “sqft_above” (superficie de la huella de la vivienda).
my3cols <- c("#E7B800", "#2E9FDF", "#FC4E07")
df_cluster$target <- as.factor(df_cluster$target)
pBoxPlot <- ggplot(df_cluster, aes(x = target, y = sqft_above))
bxp <- pBoxPlot + geom_boxplot(aes(color = target)) +
scale_color_manual(values = my3cols)
geomHist <- ggplot(data=df_cluster, aes(df_cluster$sqft_above))
gh <- geomHist + geom_histogram(aes(color = target)) +
scale_color_manual(values = my3cols)
figure <- ggarrange(bxp, gh,
labels = c("Boxplot", "Histogram"),
ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
figure
Estudio de la variable “sqft_basement” (superficie bajo rasante).
my3cols <- c("#E7B800", "#2E9FDF", "#FC4E07")
df_cluster$target <- as.factor(df_cluster$target)
pBoxPlot <- ggplot(df_cluster, aes(x = target, y = sqft_basement))
bxp <- pBoxPlot + geom_boxplot(aes(color = target)) +
scale_color_manual(values = my3cols)
geomHist <- ggplot(data=df_cluster, aes(df_cluster$sqft_basement))
gh <- geomHist + geom_histogram(aes(color = target)) +
scale_color_manual(values = my3cols)
figure <- ggarrange(bxp, gh,
labels = c("Boxplot", "Histogram"),
ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
figure
Estudio de la variable “yr_built” (año de construcción de la vivienda).
my3cols <- c("#E7B800", "#2E9FDF", "#FC4E07")
df_cluster$target <- as.factor(df_cluster$target)
pBoxPlot <- ggplot(df_cluster, aes(x = target, y = yr_built))
bxp <- pBoxPlot + geom_boxplot(aes(color = target)) +
scale_color_manual(values = my3cols)
geomHist <- ggplot(data=df_cluster, aes(df_cluster$yr_built))
gh <- geomHist + geom_histogram(aes(color = target)) +
scale_color_manual(values = my3cols)
figure <- ggarrange(bxp, gh,
labels = c("Boxplot", "Histogram"),
ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
figure
Estudio de la variable “yr_renovated” (año de renovación de la vivienda).
my3cols <- c("#E7B800", "#2E9FDF", "#FC4E07")
df_cluster$target <- as.factor(df_cluster$target)
pBoxPlot <- ggplot(df_cluster, aes(x = target, y = yr_renovated))
bxp <- pBoxPlot + geom_boxplot(aes(color = target)) +
scale_color_manual(values = my3cols)
geomHist <- ggplot(data=df_cluster, aes(df_cluster$yr_renovated))
gh <- geomHist + geom_histogram(aes(color = target)) +
scale_color_manual(values = my3cols)
figure <- ggarrange(bxp, gh,
labels = c("Boxplot", "Histogram"),
ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
figure
Estudio de la variable “bathrooms” (Número de baños/aseos por vivienda):
var_bathrooms_1 = df_target_1$bathrooms
var_bathrooms_0 = df_target_0$bathrooms
name_1 = "target_1 bathrooms"
name_2 = "target_0 bathrooms"
my3cols <- c("#E7B800", "#2E9FDF", "#FC4E07")
# Tablas de frecuencias en función al mes y al año
summary(var_bathrooms_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.750 2.250 2.500 2.675 3.250 8.000
summary(var_bathrooms_0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.500 1.500 2.000 1.931 2.500 7.500
pb1<-ggplot(df_target_1, aes(unlist(var_bathrooms_1), fill=unlist(var_bathrooms_1))) +
geom_bar(position="dodge", fill='blue', color="blue") +
labs(x= name_1, y = 'Frecuencia', fill=NULL)
pb2<-ggplot(df_target_0, aes(unlist(var_bathrooms_0), fill=unlist(var_bathrooms_0))) +
geom_bar(position="dodge", fill='red', color="red") +
labs(x= name_2, y = 'Frecuencia', fill=NULL)
figure <- ggarrange(pb1, pb2)
figure
out = histbackback(list(Target_1 = df_cluster %>% filter(target == "1") %>% filter(!is.na(bathrooms)) %>% .$bathrooms,
Target_0 = df_cluster %>% filter(target == "0") %>% filter(!is.na(bathrooms)) %>% .$bathrooms),
probability = TRUE, main = "Comparativa Variable bathrooms")
# Colorear mitad izquierda y derecha del gráfico
barplot(-out$left, col="blue" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="red", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
Estudio de la variable “bedrooms” (Número de habitaciones por vivienda):
var_bedrooms_1 = df_target_1$bedrooms
var_bedrooms_0 = df_target_0$bedrooms
name_1 = "target_1 bedrooms"
name_2 = "target_0 bedrooms"
my3cols <- c("#E7B800", "#2E9FDF")
# Tablas de frecuencias en función al mes y al año
summary(var_bedrooms_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.824 4.000 10.000
summary(var_bedrooms_0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.224 4.000 33.000
pb1<-ggplot(df_target_1, aes(unlist(var_bedrooms_1), fill=unlist(var_bedrooms_1))) +
geom_bar(position="dodge", fill='blue', color="blue") +
labs(x= name_1, y = 'Frecuencia', fill=NULL)
pb2<-ggplot(df_target_0, aes(unlist(var_bedrooms_0), fill=unlist(var_bedrooms_0))) +
geom_bar(position="dodge", fill='red', color="red") +
labs(x= name_2, y = 'Frecuencia', fill=NULL)
figure <- ggarrange(pb1, pb2)
figure
out = histbackback(list(Target_1 = df_cluster %>% filter(target == "1") %>% filter(!is.na(bedrooms)) %>% .$bedrooms,
Target_0 = df_cluster %>% filter(target == "0") %>% filter(!is.na(bedrooms)) %>% .$bedrooms),
probability = TRUE, main = "Comparativa Variable Bedrooms")
# Colorear mitad izquierda y derecha del gráfico
barplot(-out$left, col="blue" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="red", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
Estudio de la variable “floors” (Número de plantas por vivienda):
var_floors_1 = df_target_1$floors
var_floors_0 = df_target_0$floors
name_1 = "target_1 floors"
name_2 = "target_0 floors"
my3cols <- c("#E7B800", "#2E9FDF")
# Tablas de frecuencias en función al mes y al año
summary(var_floors_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.713 2.000 3.500
summary(var_floors_0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.422 2.000 3.500
pb1<-ggplot(df_target_1, aes(unlist(var_floors_1), fill=unlist(var_floors_1))) +
geom_bar(position="dodge", fill='blue', color="blue") +
labs(x= name_1, y = 'Frecuencia', fill=NULL)
pb2<-ggplot(df_target_0, aes(unlist(var_floors_0), fill=unlist(var_floors_0))) +
geom_bar(position="dodge", fill='red', color="red") +
labs(x= name_2, y = 'Frecuencia', fill=NULL)
figure <- ggarrange(pb1, pb2)
figure
out = histbackback(list(Target_1 = df_cluster %>% filter(target == "1") %>% filter(!is.na(floors)) %>% .$floors,
Target_0 = df_cluster %>% filter(target == "0") %>% filter(!is.na(floors)) %>% .$floors),
probability = TRUE, main = "Comparativa Variable floors")
# Colorear mitad izquierda y derecha del gráfico
barplot(-out$left, col="blue" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="red", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
Estudio de la variable “condition” (estado de la vivienda del 1 al 5):
var_condition_1 = df_target_1$condition
var_condition_0 = df_target_0$condition
name_1 = "target_1 condition"
name_2 = "target_1 condition"
my3cols <- c("#E7B800", "#2E9FDF")
# Tablas de frecuencias en función al mes y al año
summary(var_condition_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.462 4.000 5.000
summary(var_condition_0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.393 4.000 5.000
pb1<-ggplot(df_target_1, aes(unlist(var_condition_1), fill=unlist(var_condition_1))) +
geom_bar(position="dodge", fill='blue', color="blue") +
labs(x= name_1, y = 'Frecuencia', fill=NULL)
pb2<-ggplot(df_target_0, aes(unlist(var_condition_0), fill=unlist(var_condition_0))) +
geom_bar(position="dodge", fill='red', color="red") +
labs(x= name_2, y = 'Frecuencia', fill=NULL)
figure <- ggarrange(pb1, pb2)
figure
out = histbackback(list(Target_1 = df_cluster %>% filter(target == "1") %>% filter(!is.na(condition)) %>% .$condition,
Target_0 = df_cluster %>% filter(target == "0") %>% filter(!is.na(condition)) %>% .$condition),
probability = TRUE, main = "Comparativa Variable condition")
# Colorear mitad izquierda y derecha del gráfico
barplot(-out$left, col="blue" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="red", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
Estudio de la variable “waterfront” (viviendas frente a grandes masas de agua):
var_waterfront_1 = df_target_1$waterfront
var_waterfront_0 = df_target_0$waterfront
name_1 = "target_1 waterfront"
name_2 = "target_0 waterfront"
my3cols <- c("#E7B800", "#2E9FDF")
# Tablas de frecuencias en función al mes y al año
summary(var_waterfront_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02514 0.00000 1.00000
summary(var_waterfront_0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.001726 0.000000 1.000000
pb1<-ggplot(df_target_1, aes(unlist(var_waterfront_1), fill=unlist(var_waterfront_1))) +
geom_bar(position="dodge", fill='blue', color="blue") +
labs(x= name_1, y = 'Frecuencia', fill=NULL)
pb2<-ggplot(df_target_0, aes(unlist(var_waterfront_0), fill=unlist(var_waterfront_0))) +
geom_bar(position="dodge", fill='red', color="red") +
labs(x= name_2, y = 'Frecuencia', fill=NULL)
figure <- ggarrange(pb1, pb2)
figure
Estudio de la variable “grade” (nota general de la vivienda del 1 al 13):
var_grade_1 = df_target_1$grade
var_grade_0 = df_target_0$grade
name_1 = "target_1 grade"
name_2 = "target_0 grade"
my3cols <- c("#E7B800", "#2E9FDF")
# Tablas de frecuencias en función al mes y al año
summary(var_grade_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 8.000 9.000 8.844 10.000 13.000
summary(var_grade_0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 7.000 7.000 7.265 8.000 11.000
pb1<-ggplot(df_target_1, aes(unlist(var_grade_1), fill=unlist(var_grade_1))) +
geom_bar(position="dodge", fill='blue', color="blue") +
labs(x= name_1, y = 'Frecuencia', fill=NULL)
pb2<-ggplot(df_target_0, aes(unlist(var_grade_0), fill=unlist(var_grade_0))) +
geom_bar(position="dodge", fill='red', color="red") +
labs(x= name_2, y = 'Frecuencia', fill=NULL)
figure <- ggarrange(pb1, pb2)
figure
out = histbackback(list(Target_1 = df_cluster %>% filter(target == "1") %>% filter(!is.na(grade)) %>% .$grade,
Target_0 = df_cluster %>% filter(target == "0") %>% filter(!is.na(grade)) %>% .$grade),
probability = TRUE, main = "Comparativa Variable grade")
# Colorear mitad izquierda y derecha del gráfico
barplot(-out$left, col="blue" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="red", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
Estudio de la variable “view” (número de visitas que ha recibido la vivienda):
var_view_1 = df_target_1$view
var_view_0 = df_target_0$view
name_1 = "target_1 view"
name_2 = "target_0 view"
my3cols <- c("#E7B800", "#2E9FDF")
# Tablas de frecuencias en función al mes y al año
summary(var_view_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.6258 0.0000 4.0000
summary(var_view_0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1047 0.0000 4.0000
pb1<-ggplot(df_target_1, aes(unlist(var_view_1), fill=unlist(var_view_1))) +
geom_bar(position="dodge", fill='blue', color="blue") +
labs(x= name_1, y = 'Frecuencia', fill=NULL)
pb2<-ggplot(df_target_0, aes(unlist(var_view_0), fill=unlist(var_view_0))) +
geom_bar(position="dodge", fill='red', color="red") +
labs(x= name_2, y = 'Frecuencia', fill=NULL)
figure <- ggarrange(pb1, pb2)
figure
out = histbackback(list(Target_1 = df_cluster %>% filter(target == "1") %>% filter(!is.na(view)) %>% .$view,
Target_0 = df_cluster %>% filter(target == "0") %>% filter(!is.na(view)) %>% .$view),
probability = TRUE, main = "Comparativa Variable view")
# Colorear mitad izquierda y derecha del gráfico
barplot(-out$left, col="blue" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="red", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
summary (df_cluster)
## X.2 X.1 X Y
## Min. : 1 Min. : 1 Min. :-30.6010 Min. :-28.84906
## 1st Qu.: 5400 1st Qu.: 5400 1st Qu.: -7.3212 1st Qu.:-12.29488
## Median :10799 Median :10799 Median : 0.2695 Median : 0.09209
## Mean :10799 Mean :10799 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.:16198 3rd Qu.:16198 3rd Qu.: 8.5906 3rd Qu.: 11.16195
## Max. :21597 Max. :21597 Max. : 27.5371 Max. : 30.13228
##
## id date price bedrooms
## Min. :1.000e+06 6/23/2014: 142 Min. : 78000 Min. : 1.000
## 1st Qu.:2.123e+09 6/25/2014: 131 1st Qu.: 322000 1st Qu.: 3.000
## Median :3.905e+09 6/26/2014: 131 Median : 450000 Median : 3.000
## Mean :4.580e+09 7/8/2014 : 127 Mean : 540297 Mean : 3.373
## 3rd Qu.:7.309e+09 4/27/2015: 126 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :9.900e+09 3/25/2015: 123 Max. :7700000 Max. :33.000
## (Other) :20817
## bathrooms sqft_living sqft_lot floors
## Min. :0.500 Min. : 370 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1430 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.116 Mean : 2080 Mean : 15099 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10685 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
##
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.00 Min. : 3.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.00 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.00 Median : 7.000
## Mean :0.007547 Mean :0.2343 Mean :3.41 Mean : 7.658
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.00 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.00 Max. :13.000
##
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 370 Min. : 0.0 Min. :1900 Min. : 0.00
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00
## Median :1560 Median : 0.0 Median :1975 Median : 0.00
## Mean :1789 Mean : 291.7 Mean :1971 Mean : 84.46
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.00
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.00
##
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
##
## sqft_lot15 cluster hclust Cluster_final
## Min. : 651 0:16226 1 :4630 low:9970
## 1st Qu.: 5100 1: 5371 9 :3110 med:8225
## Median : 7620 3 :2859 top:3402
## Mean : 12758 8 :2370
## 3rd Qu.: 10083 5 :2230
## Max. :871200 7 :1962
## (Other):4436
## cluster_final target
## Length:21597 0:16226
## Class :character 1: 5371
## Mode :character
##
##
##
##
En el siguiente gráfico descriptivo vemos que apraecen altas y medias correlaciones entre algunas variables por lo que vamos a tener que realizar un trabajo con las mismas con Ridge y Lasso.
pm <- ggpairs(
df_cluster[, c( 7,8,9,10,11,17,18,24,25,30)],
ggplot2::aes(colour=target))
pm
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Definimos una semilla aleatoria y realizamos una partición en Train 70% y Test 30%.
set.seed(737)
#inTraining <- createDataPartition(df_cluster$id, p = .7, list = FALSE)
#train <- df_cluster[inTraining,]
#control <- df_cluster[-inTraining,]
split_data <- function(data, trn = .7, val = .2, tst = .1) {
set.seed(737)
spec = c(train = trn, validate = val, test = tst)
# cutter
g = sample(cut(seq(nrow(data)), nrow(data)*cumsum(c(0,spec)), labels = names(spec)))
# spliter
data <- split(data, g)
return(data)
}
data <- split_data(df_cluster, 0.7, 0.2, 0.1)
train <- data$train
control <- data$test
validation<- data$validate
table(train$target)
##
## 0 1
## 11364 3753
table(control$target)
##
## 0 1
## 1618 542
table(validation$target)
##
## 0 1
## 3244 1076
Intenta minimizar el RSS, ridge regression incorpora un término llamado shrinkage penalty que fuerza a que los coeficientes de los predictores tiendan a cero controlada por el parámetro λ. Cuando λ=0 la penalización es nula y los resultados son equivalentes a los obtenidos por mínimos cuadrados, cuando λ=∞ todos los coeficientes son cero. La principal ventaja es la reducción de Varianza. Si todos los predictores incluidos tienen coeficientes diferentes a cero (todos contribuyen al modelo) y aproximadamente de la misma magnitud, ridge regression tiende a funcionar mejor.
Para realizar ridge regression se va a emplear la función glmnet() del paquete glmnet.
#matriz con las valores de los predictores para cada observación y un vector y=target variable respuesta
x <- model.matrix(target~ (bedrooms+ bathrooms+floors+sqft_living+grade+condition+view+waterfront+ sqft_lot+sqft_above+sqft_basement+yr_built+yr_renovated+yr_renovated+
sqft_living15+sqft_lot15+Cluster_final), data = train)[, -1]
head(x)
## bedrooms bathrooms floors sqft_living grade condition view waterfront
## 1 3 1.0 1 1180 7 3 0 0
## 4 4 3.0 1 1960 7 5 0 0
## 5 3 2.0 1 1680 8 3 0 0
## 6 4 4.5 1 5420 11 3 0 0
## 8 3 1.5 1 1060 7 3 0 0
## 9 3 1.0 1 1780 7 3 0 0
## sqft_lot sqft_above sqft_basement yr_built yr_renovated sqft_living15
## 1 5650 1180 0 1955 0 1340
## 4 5000 1050 910 1965 0 1360
## 5 8080 1680 0 1987 0 1800
## 6 101930 3890 1530 2001 0 4760
## 8 9711 1060 0 1963 0 1650
## 9 7470 1050 730 1960 0 1780
## sqft_lot15 Cluster_finalmed Cluster_finaltop
## 1 5650 0 0
## 4 5000 0 0
## 5 7503 1 0
## 6 101930 1 0
## 8 9711 1 0
## 9 8113 1 0
y <- train$target
y <- as.integer(y)
# Para obtener un ajuste mediante ridge regression se indica argumento alpha=0.
modelos_ridge <- glmnet(x = x, y = y, alpha = 0)
plot(modelos_ridge, xvar = "lambda", label = TRUE)
Al aumentar el tamaño de los Lambda dismunuyen los coeficientes. Con el fin de identificar el valor de λ que da lugar al mejor modelo, se puede recurrir a Cross-Validation. La función cv.glmnet() calcula el cv-test-error, utilizando por defecto k=10.
set.seed(737)
cv_error_ridge <- cv.glmnet(x = x, y = y, alpha = 0, nfolds = 15,
type.measure = "mse")
plot(cv_error_ridge)
Podemos observar cómo varía el error cuadrático medio, en función del valor de regularización. Gráficamente se comprueba que el error no aumenta hasta que las variables con coeficiente mayor que cero es menor que -2, pero el menor error cuadrático medio se da para 17 variables regresoras y se mantiene constante. Es una de las grandes diferencias con Lasso.
# Valor lambda con el que se consigue el mínimo test-error
cv_error_ridge$lambda.min
## [1] 0.02542122
# Valor lambda óptimo: mayor valor de lambda con el que el test-error no se
# aleja más de 1 sd del mínimo test-error posible.
cv_error_ridge$lambda.1se
## [1] 0.07073607
# Se muestra el valor de los coeficientes para el valor de lambda óptimo
modelo_final_ridge <- glmnet(x = x, y = y, alpha = 0, lambda = cv_error_ridge$lambda.1se)
coef(modelo_final_ridge)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 6.103194e+00
## bedrooms -6.975913e-03
## bathrooms 1.508731e-02
## floors 3.821222e-02
## sqft_living 5.687788e-05
## grade 9.519157e-02
## condition 4.068419e-02
## view 4.076669e-02
## waterfront 1.292870e-02
## sqft_lot 4.959385e-08
## sqft_above 5.792261e-05
## sqft_basement 4.164063e-05
## yr_built -3.164416e-03
## yr_renovated -1.651793e-05
## sqft_living15 7.787366e-05
## sqft_lot15 -3.856229e-07
## Cluster_finalmed 7.737745e-02
## Cluster_finaltop 1.686181e-01
El método lasso, al igual que ridge regression, fuerza a que las estimaciones de los coeficientes de los predictores tiendan a cero. La diferencia es que lasso sí es capaz de fijar algunos de ellos exactamente a cero, lo que permite además de reducir la varianza, realizar selección de predictores. ∑i=1n(yi−β0−∑j=1pβjxij)2+λ∑j=1p|βj|=RSS+λ∑j=1p|βj|
Cuando solo un pequeño número de predictores de entre todos los incluidos tienen coeficientes sustanciales y el resto tienen valores muy pequeños o iguales a cero, lasso genera mejores modelos.
Selección del tunning parameter λ
Determinar el grado de penalización, seleccionamos un rango de valores de λ y se estima el cross-validation error resultante para cada uno, finalmente se selecciona el valor de λ para el que el error es menor y se ajusta de nuevo el modelo, esta vez empleando todas las observaciones.
modelos_lasso <- glmnet(x = x, y = y, alpha = 1)
plot(modelos_lasso, xvar = "lambda", label = TRUE)
set.seed(737)
cv_error_lasso <- cv.glmnet(x = x, y = y, alpha = 1, nfolds = 10)
plot(cv_error_lasso)
Podemos observar cómo varía el error cuadrático medio, en función del valor de regularización. Gráficamente se comprueba que el error no aumenta hasta que las variables con coeficiente mayor que cero es menor que -4, pero el menor error cuadrático medio se da para 3 variables regresoras.
cv_error_lasso$lambda.min
## [1] 0.0004143026
cv_error_lasso$lambda.1se
## [1] 0.008925879
# Se reajusta el modelo con todas las observaciones empleando el valor de
# lambda óptimo
modelo_final_lasso <- glmnet(x = x, y = y, alpha = 1, lambda = cv_error_lasso$lambda.1se)
coef(modelo_final_lasso)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 6.595800e+00
## bedrooms .
## bathrooms .
## floors 2.666460e-02
## sqft_living 9.143041e-05
## grade 1.300344e-01
## condition 2.862539e-02
## view 3.238449e-02
## waterfront .
## sqft_lot .
## sqft_above 6.118704e-06
## sqft_basement .
## yr_built -3.487305e-03
## yr_renovated .
## sqft_living15 6.692732e-05
## sqft_lot15 -6.026306e-09
## Cluster_finalmed 7.061614e-02
## Cluster_finaltop 1.527260e-01
la ventaja del modelo final obtenido por lasso es que es mucho más simple ya que contiene únicamente ‘n’ predictores A continuación, ajustamos un modelo de regresión con el λ para las variables significativas No obstante, como se observa en la gráfica del error podríamos obtener un modelo con sólo 10 variables cuyo error es muy similar. Para ello buscamos el valor de λ para el cual obtenemos el primer conjunto con 10 variables.
par(mfrow = c(1,2))
plot(cv_error_ridge,ylab = "Mean Square Error ridge regression" )
abline(h = 120000)
plot(cv_error_lasso,ylab = "Mean Square Error lasso")
abline(h = 120000)
par(mfrow = c(1,1))
A partir de las variables del Lasso
train_glm1 = glm(target ~ floors + grade + condition + view + sqft_above + yr_built + sqft_living15 + Cluster_final ,
family = binomial,
data = train )
summary(train_glm1)
##
## Call:
## glm(formula = target ~ floors + grade + condition + view + sqft_above +
## yr_built + sqft_living15 + Cluster_final, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3215 -0.4129 -0.2028 -0.0266 3.4445
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.017e+01 2.628e+00 26.702 < 2e-16 ***
## floors 4.191e-01 7.004e-02 5.984 2.17e-09 ***
## grade 1.450e+00 4.678e-02 30.993 < 2e-16 ***
## condition 5.717e-01 4.777e-02 11.967 < 2e-16 ***
## view 2.564e-01 3.473e-02 7.383 1.54e-13 ***
## sqft_above 8.171e-04 6.321e-05 12.927 < 2e-16 ***
## yr_built -4.541e-02 1.411e-03 -32.176 < 2e-16 ***
## sqft_living15 6.844e-04 6.490e-05 10.546 < 2e-16 ***
## Cluster_finalmed 1.511e+00 7.487e-02 20.183 < 2e-16 ***
## Cluster_finaltop 1.542e+00 8.458e-02 18.231 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16943.7 on 15116 degrees of freedom
## Residual deviance: 8370.4 on 15107 degrees of freedom
## AIC: 8390.4
##
## Number of Fisher Scoring iterations: 6
Viendo los resultados de la regresión logística podemos afirmar que todas las variables introducidas en el modelo son significativas a nivel estadístico, aunque podemos apreciar en las variables Floors y View un valor del estadísitico no demasiado alto lo cual nos dice que no son especialmete explicativas.
train_glm2 = update(train_glm1, . ~ . - view - floors) # Eliminamos dos predictores
anova(train_glm1, train_glm2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: target ~ floors + grade + condition + view + sqft_above + yr_built +
## sqft_living15 + Cluster_final
## Model 2: target ~ grade + condition + sqft_above + yr_built + sqft_living15 +
## Cluster_final
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 15107 8370.4
## 2 15109 8461.5 -2 -91.097 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Esta salida nos dice que el ajuste es estadísticamente significativo pero a nivel negocio son importantes y en el primer modelo eran significativas por lo que las vamos a mantener.
head(predict(train_glm1, type = "response")) # Probabilidades en escala de la salida
## 1 4 5 6 8 9
## 0.01171684 0.02107990 0.09943194 0.99523562 0.04019918 0.04946034
summary(train_glm1$fitted.values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000455 0.0199307 0.0755619 0.2482635 0.3757317 0.9999988
Vemos en la Matriz de Confusión que el modelo tenemos problemas con Falsos Positivos, el modelo predice de forma notable una casa como de Precio bajo cuando son de precio Alto. Esto para nuestro negocio tiene implicaciones ya que lo que queremos evitar es no pronosticar bien las casas de Precio Alto.
predictions <- predict(train_glm1, train,type='response')
plot.roc(train$target, predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
table(pred = predictions > 0.5, obs = train$target)
## obs
## pred 0 1
## FALSE 10758 1187
## TRUE 606 2566
data = as.numeric(predictions>0.5)
data=as.factor(data)
y_test=as.factor(train$target)
# use caret and compute a confusion matrix
confusionMatrix(data, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 10758 1187
## 1 606 2566
##
## Accuracy : 0.8814
## 95% CI : (0.8761, 0.8865)
## No Information Rate : 0.7517
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6649
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9467
## Specificity : 0.6837
## Pos Pred Value : 0.9006
## Neg Pred Value : 0.8090
## Prevalence : 0.7517
## Detection Rate : 0.7116
## Detection Prevalence : 0.7902
## Balanced Accuracy : 0.8152
##
## 'Positive' Class : 0
##
#VEro testing ROCit library
library(ROCit)
##
## Attaching package: 'ROCit'
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:car':
##
## logit
ROCit_obj <- rocit(score = predictions, class = train$target)
plot(ROCit_obj)
#END Vero testing
Tal y como se puede observar, sobre la curva ROC tenemos que tiende mas hacia los 90º que hacia los 45º. Esto lo que nos indica es que pese a tener falsos positivos el test ha salido bastante preciso.
Habria que mirar el AUC???
require(ROCR)
predictions <- predict(train_glm1, train,type='response')
pred<- prediction(predictions, train$target)
gain <- performance(pred, "tpr", "rpp")
plot(gain, main = "Gain Chart Train Population")
abline(a=0,b=1)
Por la grafica obtenida se observa que al 70% aprox ya se tiene una respuesta del 98%. Esto permitira ahorrar unos costes de aproximadamente un 30% donde solo perderiamos el 2% de las respuestas.
# lift chart
perf <- performance(pred,"lift","rpp")
plot(perf, main="lift curve")
# Diferencia entre el logit y el probit
X=seq(from=-4,to=4,by=0.1)
sigmoide=1/(1+exp(-X))
cumulative<-pnorm(X, 0, 1)
plot(sigmoide,type="l",col="red")
lines(cumulative,col="blue")
model_logit1 = glm(target ~ floors + grade + condition + view + sqft_above + yr_built + sqft_living15 + Cluster_final ,
family = binomial (link="logit") ,
data = train )
summary(model_logit1)
##
## Call:
## glm(formula = target ~ floors + grade + condition + view + sqft_above +
## yr_built + sqft_living15 + Cluster_final, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3215 -0.4129 -0.2028 -0.0266 3.4445
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.017e+01 2.628e+00 26.702 < 2e-16 ***
## floors 4.191e-01 7.004e-02 5.984 2.17e-09 ***
## grade 1.450e+00 4.678e-02 30.993 < 2e-16 ***
## condition 5.717e-01 4.777e-02 11.967 < 2e-16 ***
## view 2.564e-01 3.473e-02 7.383 1.54e-13 ***
## sqft_above 8.171e-04 6.321e-05 12.927 < 2e-16 ***
## yr_built -4.541e-02 1.411e-03 -32.176 < 2e-16 ***
## sqft_living15 6.844e-04 6.490e-05 10.546 < 2e-16 ***
## Cluster_finalmed 1.511e+00 7.487e-02 20.183 < 2e-16 ***
## Cluster_finaltop 1.542e+00 8.458e-02 18.231 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16943.7 on 15116 degrees of freedom
## Residual deviance: 8370.4 on 15107 degrees of freedom
## AIC: 8390.4
##
## Number of Fisher Scoring iterations: 6
train$prediccion=predict(model_logit1,type="response")
Pred_auxiliar= prediction(train$prediccion, train$target, label.ordering = NULL)
CURVA_ROC_model_logit1_train <- performance(Pred_auxiliar,"tpr","fpr")
plot(CURVA_ROC_model_logit1_train)
abline(a=0,b=1)
## Capacidad del Modelo
#mean(as.numeric(train$target)-1)
#aggregate(train$prediccion~train$target,FUN=mean)
table(pred = train$prediccion > 0.25, obs = train$target)
## obs
## pred 0 1
## FALSE 9787 625
## TRUE 1577 3128
data = as.numeric(train$prediccion>0.25)
data=as.factor(data)
y_test=as.factor(train$target)
# use caret and compute a confusion matrix
confusionMatrix(data, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9787 625
## 1 1577 3128
##
## Accuracy : 0.8543
## 95% CI : (0.8486, 0.8599)
## No Information Rate : 0.7517
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6403
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8612
## Specificity : 0.8335
## Pos Pred Value : 0.9400
## Neg Pred Value : 0.6648
## Prevalence : 0.7517
## Detection Rate : 0.6474
## Detection Prevalence : 0.6888
## Balanced Accuracy : 0.8473
##
## 'Positive' Class : 0
##
cutoffs <- seq(0.1,0.9,0.1)
accuracy <- NULL
for (i in seq(along = cutoffs)){
prediction <- ifelse(model_logit1$fitted.values >= cutoffs[i], 1, 0) #Predicting for cut-off
accuracy <- c(accuracy,length(which(train$y ==prediction))/length(prediction)*100)
}
plot(cutoffs, accuracy, pch =19,type='b',col= "steelblue",
main ="Logistic Regression", xlab="Cutoff Level", ylab = "Accuracy %")
Vamos a calcular la precisión general o tasa de error para un Cutoff de 0.5 obtenemos un 11,86% de tasa de error
## a function for error rate
get_Error_Rate = function(trues, predicted_prb, cutoff){
preds=ifelse(predicted_prb<cutoff,0,1)
tab=table(preds, trues)
round((tab[1,2]+tab[2,1])/sum(tab), 4)
}
get_Error_Rate(train$target,model_logit1$fitted.values, 0.1)
## [1] 0.235
get_Error_Rate(train$target,model_logit1$fitted.values, 0.2)
## [1] 0.1643
get_Error_Rate(train$target,model_logit1$fitted.values, 0.3)
## [1] 0.132
get_Error_Rate(train$target,model_logit1$fitted.values, 0.4)
## [1] 0.1201
get_Error_Rate(train$target,model_logit1$fitted.values, 0.5)
## [1] 0.1186
get_Error_Rate(train$target,model_logit1$fitted.values, 0.6)
## [1] 0.1207
get_Error_Rate(train$target,model_logit1$fitted.values, 0.7)
## [1] 0.1279
get_Error_Rate(train$target,model_logit1$fitted.values, 0.8)
## [1] 0.1438
get_Error_Rate(train$target,model_logit1$fitted.values, 0.9)
## [1] 0.1651
#plot(get_Error_Rate(train$target,model_logit1$fitted.values, 0.1:1.0))
predictionsLogit <- predict(model_logit1, validation,type='response')
plot.roc(validation$target, predictionsLogit)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
table(pred = predictionsLogit > 0.5, obs = validation$target)
## obs
## pred 0 1
## FALSE 3035 366
## TRUE 209 710
data = as.numeric(predictionsLogit>0.5)
data=as.factor(data)
y_test=as.factor(validation$target)
# use caret and compute a confusion matrix
confusionMatrix(data, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3035 366
## 1 209 710
##
## Accuracy : 0.8669
## 95% CI : (0.8564, 0.8769)
## No Information Rate : 0.7509
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6259
##
## Mcnemar's Test P-Value : 7.736e-11
##
## Sensitivity : 0.9356
## Specificity : 0.6599
## Pos Pred Value : 0.8924
## Neg Pred Value : 0.7726
## Prevalence : 0.7509
## Detection Rate : 0.7025
## Detection Prevalence : 0.7873
## Balanced Accuracy : 0.7977
##
## 'Positive' Class : 0
##
train$yr_built=as.numeric(train$yr_built)
model_probit1 = glm(target ~ floors + grade + condition + view + sqft_above + yr_built + sqft_living15 + Cluster_final ,
family = binomial (link="probit") ,
data = train )
summary(model_probit1)
##
## Call:
## glm(formula = target ~ floors + grade + condition + view + sqft_above +
## yr_built + sqft_living15 + Cluster_final, family = binomial(link = "probit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6703 -0.4286 -0.1774 -0.0035 3.8798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.889e+01 1.420e+00 27.396 < 2e-16 ***
## floors 2.177e-01 3.788e-02 5.747 9.10e-09 ***
## grade 7.977e-01 2.510e-02 31.781 < 2e-16 ***
## condition 3.029e-01 2.619e-02 11.566 < 2e-16 ***
## view 1.508e-01 1.964e-02 7.676 1.64e-14 ***
## sqft_above 4.483e-04 3.477e-05 12.894 < 2e-16 ***
## yr_built -2.512e-02 7.551e-04 -33.269 < 2e-16 ***
## sqft_living15 3.942e-04 3.601e-05 10.947 < 2e-16 ***
## Cluster_finalmed 8.194e-01 4.000e-02 20.483 < 2e-16 ***
## Cluster_finaltop 8.614e-01 4.680e-02 18.404 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16943.7 on 15116 degrees of freedom
## Residual deviance: 8362.9 on 15107 degrees of freedom
## AIC: 8382.9
##
## Number of Fisher Scoring iterations: 6
train$prediccion=predict(model_probit1,type="response")
Pred_auxiliar= prediction(train$prediccion, train$target, label.ordering = NULL)
CURVA_ROC_model_probit1_train <- performance(Pred_auxiliar,"tpr","fpr")
plot(CURVA_ROC_model_probit1_train)
abline(a=0,b=1)
## Capacidad del Modelo
#mean(as.numeric(train$target)-1)
#aggregate(train$prediccion~train$target,FUN=mean)
table(pred = train$prediccion > 0.25, obs = train$target)
## obs
## pred 0 1
## FALSE 9672 583
## TRUE 1692 3170
data = as.numeric(train$prediccion>0.25)
data=as.factor(data)
y_test=as.factor(train$target)
# use caret and compute a confusion matrix
confusionMatrix(data, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9672 583
## 1 1692 3170
##
## Accuracy : 0.8495
## 95% CI : (0.8437, 0.8552)
## No Information Rate : 0.7517
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6331
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8511
## Specificity : 0.8447
## Pos Pred Value : 0.9431
## Neg Pred Value : 0.6520
## Prevalence : 0.7517
## Detection Rate : 0.6398
## Detection Prevalence : 0.6784
## Balanced Accuracy : 0.8479
##
## 'Positive' Class : 0
##
Vamos a calcular la precisión general o tasa de error para un Cutoff de 0.5 obtenemos un 11,83% de tasa de error
## a function for error rate
get_Error_Rate = function(trues, predicted_prb, cutoff){
preds=ifelse(predicted_prb<cutoff,0,1)
tab=table(preds, trues)
round((tab[1,2]+tab[2,1])/sum(tab), 4)
}
get_Error_Rate(train$target,model_probit1$fitted.values, 0.1)
## [1] 0.2436
get_Error_Rate(train$target,model_probit1$fitted.values, 0.2)
## [1] 0.1727
get_Error_Rate(train$target,model_probit1$fitted.values, 0.3)
## [1] 0.1354
get_Error_Rate(train$target,model_probit1$fitted.values, 0.4)
## [1] 0.1212
get_Error_Rate(train$target,model_probit1$fitted.values, 0.5)
## [1] 0.1183
get_Error_Rate(train$target,model_probit1$fitted.values, 0.6)
## [1] 0.1207
get_Error_Rate(train$target,model_probit1$fitted.values, 0.7)
## [1] 0.131
get_Error_Rate(train$target,model_probit1$fitted.values, 0.8)
## [1] 0.1478
get_Error_Rate(train$target,model_probit1$fitted.values, 0.9)
## [1] 0.169
#plot(get_Error_Rate(train$target,model_logit1$fitted.values, 0.1:1.0))
predictionsProbit <- predict(model_probit1, validation,type='response')
plot.roc(validation$target, predictionsProbit)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
table(pred = predictionsProbit > 0.5, obs = validation$target)
## obs
## pred 0 1
## FALSE 3041 372
## TRUE 203 704
data = as.numeric(predictionsProbit>0.5)
data=as.factor(data)
y_test=as.factor(validation$target)
# use caret and compute a confusion matrix
confusionMatrix(data, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3041 372
## 1 203 704
##
## Accuracy : 0.8669
## 95% CI : (0.8564, 0.8769)
## No Information Rate : 0.7509
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6245
##
## Mcnemar's Test P-Value : 2.451e-12
##
## Sensitivity : 0.9374
## Specificity : 0.6543
## Pos Pred Value : 0.8910
## Neg Pred Value : 0.7762
## Prevalence : 0.7509
## Detection Rate : 0.7039
## Detection Prevalence : 0.7900
## Balanced Accuracy : 0.7958
##
## 'Positive' Class : 0
##